Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support --scale-down-delay-after-* per nodegroup #5729

Conversation

vadasambar
Copy link
Member

@vadasambar vadasambar commented May 5, 2023

Signed-off-by: vadasambar surajrbanakar@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

We want to support --scale-down-delay-after-* per nodegroup so that if nodegroup A scales up, it doesn't block nodegroup B from scaling down (which is the case now). For details, check #3071

Which issue(s) this PR fixes:

Fixes #3071

Special notes for your reviewer:

This PR is still in WIP and PoC phase. The plan is to get an okay from the community before proceeding with the approach in this PR. If you have any feedback on the PR, I would love to have it. #5729 (comment)

Does this PR introduce a user-facing change?

Added: `--scale-down-delay-type-local` flag. It specifies if `--scale-down-delay-after-*` flags should be applied locally per nodegroup or globally across all nodegroups)

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


P.S.

I had originally intended to support only --scale-down-delay-after-add and hence the name of the PR branch vadasambar:feat/3071/scale-down-after-add-per-ng-poc. But I have increased the scope of the PR now to support all --scale-down-delay-after-* flags per nodegroup.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 5, 2023
@k8s-ci-robot k8s-ci-robot requested a review from feiskyer May 5, 2023 04:39
@k8s-ci-robot k8s-ci-robot requested a review from x13n May 5, 2023 04:39
@x13n
Copy link
Member

x13n commented May 5, 2023

/assign

@vadasambar vadasambar changed the title feat: wip PR to support --scale-down-delay-after-add per nodegroup feat: wip PR to support --scale-down-delay-after-* per nodegroup May 10, 2023

if lastScaledUps[nodeGroup.Id()] != nil &&
lastScaledUps[nodeGroup.Id()].ExpectedAddTime.After(currentTime) {
klog.V(4).Infof("Skipping scale down on node group %s because it was scaled up recently at %v", nodeGroup.Id(), lastScaledUps[nodeGroup.Id()].Time)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have set the log level at v4 but that might not be the best thing to do. It might help to have this at v2 so that the users know what is happening around scale down. Any feedback on this is welcome 🙏

@vadasambar vadasambar requested a review from x13n May 18, 2023 04:30
@vadasambar vadasambar force-pushed the feat/3071/scale-down-after-add-per-ng-poc branch 2 times, most recently from 4a22c5e to 24f850b Compare May 23, 2023 04:11
@vadasambar
Copy link
Member Author

vadasambar commented Nov 3, 2023

Took longer than I thought. Sorry everyone who are interested in this PR 🙏

@x13n I think the PR is ready to review. I didn't think it would grow so big. Might be difficult review as whole. I'm open to splitting it if you want (haven't thought too deeply about the how so I am not sure of the effort involved). If it helps, I am happy to schedule a call or go over the PR in the sig meeting.

@vadasambar
Copy link
Member Author

@x13n wanted to remind you about this PR 🙏

@x13n
Copy link
Member

x13n commented Nov 23, 2023

Apologies for lack of activity here, I didn't have a longer time block to review a large PR like this one recently. This is in my queue, I'll try to get to this in the next couple of days.

@vadasambar
Copy link
Member Author

Apologies for lack of activity here, I didn't have a longer time block to review a large PR like this one recently. This is in my queue, I'll try to get to this in the next couple of days.

Noted. If it decreases your load, I can do first pass reviews on some of the PRs that you haven't reviewed yet and you can do the second pass. I recently started reviewing PRs hoping to become a reviewer (maybe not as good as you :D).

@x13n
Copy link
Member

x13n commented Nov 23, 2023

Sadly my load doesn't only come from PR reviews :) I do appreciate the offer to review other PRs though, this repo would certainly benefit from having more reviewers!

Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Added a bunch of comments.

limitations under the License.
*/

package observers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like the idea of having a generic observers package. If we keep on adding different observers here, it will become a mix of completely unrelated objects, except for the fact that they implement the observer pattern. Let's have a specific observer for specific use case, in a specific package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is value in having observers under the same package because it makes them easy to identify (just like processors directory). Maybe something like this makes more sense?

├── observers
│   └── nodegroupchange
│       └── scale_state_observer.go

This makes the import path as ../cluster-autoscaler/observers/nodegroupchange. We'd still have a specific package called nodegroupchange while letting the user know that it's an observer.

If we go with specific observer -> specific use case -> specific package approach, the name of the package is going to be nodegroupchangeobserver which IMO is a bit too long. We can shorten it to ngchangeobserver or ngcobserver but that means you need to know ng stands for nodegroup or (nodegroupchange if we shorten it to ngc).

Open to other better approaches as well (I am leaning more towards first). Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping under the same observers directory, but having separate packages sounds acceptable to me. I disagree we need to have observer as part of the package name though: referring to the object externally as nodegroupchange.Observer should be well readable, there's no need to repeat the same word in the package name and in the type names within the package.

cluster-autoscaler/observers/scale_state_observer.go Outdated Show resolved Hide resolved
cluster-autoscaler/config/const.go Outdated Show resolved Hide resolved
// RegisterFailedScaleUp records failed scale-up for a nodegroup.
RegisterFailedScaleUp(nodeGroup cloudprovider.NodeGroup, reason string, gpuResourceName, gpuType string, currentTime time.Time) error
// RegisterFailedScaleDown records failed scale-down for a nodegroup.
RegisterFailedScaleDown(nodeGroup cloudprovider.NodeGroup, reason string, currentTime time.Time) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No implementation of RegisterFailedScaleDown is actually returning non-nil errors, so let's drop the return value from interface. Also, observers are supposed to observe, not interfere with execution, so it actually makes sense for all these functions to not return anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the code to return errors for all the functions except RegisterScaleUp because I want to use it as a replacement for RegisterOrUpdateScaleUp and return an error in case a negative delta is passed to RegisterScaleUp . More info in #5729 (comment)

Open to any other options/ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if an int is negative is simple enough to be done before making a call to RegisterScaleUp. It would then be possible to remove the error return type.

@@ -344,6 +347,250 @@ func TestStaticAutoscalerRunOnce(t *testing.T) {
mock.AssertExpectationsForObjects(t, onScaleUpMock)
}

func TestStaticAutoscalerRunOnceWithScaleDownDelayPerNG(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this test table based? Looks like it is sharing a lot of code with other tests in this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code. Let me know what you think.

cluster-autoscaler/processors/processors.go Outdated Show resolved Hide resolved
time.Now())
currentTime); err != nil {
return errors.ToAutoscalerError(errors.InternalError, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential future follow-up comment (this PR is already huge so I wouldn't do it here): metrics and events exporting done below look like great candidates for becoming observers of the same notification as well.


currentTime := time.Now()

if !p.scaleUps[nodeGroup.Id()].IsZero() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following 3 blocks of code are almost identical, looks like opportunity for sharing a bit of code.

Copy link
Member Author

@vadasambar vadasambar Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks better. I'm not sure if recent is the best name here, but I don't have a better idea :)

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 30, 2023
@vadasambar
Copy link
Member Author

Thank you for the review @x13n. A little bogged down with multiple things. I will mention you here once I get through addressing all the comments 🙏

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 14, 2023
recent := func(m map[string]time.Time, d time.Duration, msg string) bool {
if !m[nodeGroup.Id()].IsZero() && m[nodeGroup.Id()].Add(d).After(currentTime) {
klog.V(4).Infof("Skipping scale down on node group %s because it %s recently at %v",
msg, nodeGroup.Id(), m[nodeGroup.Id()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg and nodeGroup.Id() arguments should be swapped to fit the format string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Swapped them now.


currentTime := time.Now()

if !p.scaleUps[nodeGroup.Id()].IsZero() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks better. I'm not sure if recent is the best name here, but I don't have a better idea :)

Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: update scale down status after every scale up
- move scaledown delay status to cluster state/registry
- enable scale down if  `ScaleDownDelayTypeLocal` is enabled
- add new funcs on cluster state to get and update scale down delay status
- use timestamp instead of booleans to track scale down delay status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use existing fields on clusterstate
- uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus`
- changed the above existing fields a little to make them more convenient for use
- moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove note saying only `scale-down-after-add` is supported
- because we are supporting all the flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add`
- because it is not true anymore
- we are supporting all `--scale-down-delay-after-*` flags per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix clusterstate tests failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: move back initializing processors logic to from static autoscaler to main
- we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors
- and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: revert changes related to `clusterstate`
- since I am going with observer pattern
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add observer interface for state of scaling
- to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same)
- refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove params passed to `clearScaleUpFailures`
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: revert clusterstate tests
- approach has changed
- I am not making any changes in clusterstate now
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add accidentally deleted lines for clusterstate test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: implement `Add` fn for scale state observer
- to easily add new observers
- re-word comments
- remove redundant params from `NewDefaultScaleDownCandidatesProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: CI complaining because no comments on fn definitions
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: initialize parent `ScaleDownCandidatesProcessor`
- instead  of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add scale state notifier to list of default processors
- initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn
- this allows more flexibility
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add observer interface
- create a separate observer directory
- implement `RegisterScaleUp` function in the clusterstate
- TODO: resolve syntax errors
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: use `scaleStateNotifier` in place of `clusterstate`
- delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory)
- register `clustertstate` with `scaleStateNotifier`
- use `Register` instead of `Add` function in `scaleStateNotifier`
- fix `go build`
- wip: fixing tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix syntax errors
- add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip track scale down failures along with scale up failures
- I was tracking scale up failures but not scale down failures
- fix copyright year 2017 -> 2023 for the new `pointers` package
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: register failed scale down with scale state notifier
- wip writing tests for `scale_down_candidates_delay_processor`
- fix CI lint errors
- remove test file for `scale_down_candidates_processor` (there is not much to test as of now)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add unit tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor`
- not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: better doc comments for `TestGetScaleDownCandidates`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't ignore error in `NGChangeObserver`
- return it instead and let the caller decide what to do with it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: change pointers to values in `NGChangeObserver` interface
- easier to work with
- remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now)
- add tests for clusterstate's `RegisterScaleUp`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: conditions in `GetScaleDownCandidates`
- set scale down in cool down if the number of scale down candidates is 0
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: use `ng1` instead of `ng2` in existing test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: assign directly instead of using `sdProcessor` variable
- variable is not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: first working test for static autoscaler
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: continue working on static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip second static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `Println` used for debugging
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add static_autoscaler tests for scale down delay per nodegroup flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: rebase off the latest `master`
- change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix clusterstate test failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix failing orchestrator test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor`
- describes the processor better
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver`
- makes it easier to understand for someone not familiar with the codebase
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: reword code comment `after` -> `for which`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't return error from `RegisterScaleDown`
- not needed as of now (no implementer function returns a non-nil error for this function)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address review comments around ng change observer interface
- change dir structure of nodegroup change observer package
- stop returning errors wherever it is not needed in the nodegroup change observer interface
- rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: make nodegroupchange observer thread-safe
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add TODO to consider using multiple mutexes in nodegroupchange observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `time.Now()` directly instead of assigning a variable to it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: share code for checking if there was a recent scale-up/down/failure
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: convert `ScaleDownCandidatesDelayProcessor` into table tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()`
- makes it easier to understand what the function does
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: replace scale state notifier `Register` -> `RegisterForNotifications` in test
- to fix syntax errors since it is already renamed in the actual code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `clusterStateRegistry` from `delete_in_batch` tests
- not needed anymore since we have `scaleStateNotifier`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address PR review comments
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: add empty `RegisterFailedScaleDown` for clusterstate
- fix syntax error in static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@vadasambar vadasambar force-pushed the feat/3071/scale-down-after-add-per-ng-poc branch from 6aa4922 to 5de49a1 Compare January 11, 2024 16:17
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 11, 2024
@vadasambar vadasambar requested a review from x13n January 11, 2024 16:36
@vadasambar
Copy link
Member Author

@x13n I have addressed the review comments. Can I get a review on this again 🙏

@x13n
Copy link
Member

x13n commented Jan 12, 2024

Thanks! This was a lot of work, but I think it is now good to go.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vadasambar, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2024
@k8s-ci-robot k8s-ci-robot merged commit ffb54c8 into kubernetes:master Jan 12, 2024
6 checks passed
@WarpRat
Copy link

WarpRat commented Jan 25, 2024

Big thanks to everyone who put the work in on this! Will this be back ported into older versions of the autoscaler? We'd love to take advantage of this after the next release.

@vadasambar
Copy link
Member Author

Big thanks to everyone who put the work in on this! Will this be back ported into older versions of the autoscaler? We'd love to take advantage of this after the next release.

cherry-pick to 1.29 is clean: #6484

There's a lot of merge conflicts in 1.28 (and I am assuming 1.27 as well).

@Hexcles
Copy link

Hexcles commented Oct 11, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should --scale-down-delay-after-add be per-nodepool?
6 participants